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1. INTRODUCTION 

Business process is a set of dependent activities with a structure within a company or business that 
following some logical order with a target to create desired result [1]. Business process management system 
(BPMS) arise as a solution to support different aspects of business processes between organization in the 21“ 
century [2]. Thus, whenever a business process is executed, BPMS will produce a process log or event log [3]. 
These event log often also known as “audit trail”, “transaction log” or “history” of business process [4], [5]. 
Usually, business process event logs usually stored in extensible markup language (XML) format [6]. The first 
standard for event log is mining extensible markup language (MXML) began in 2003 and updated to extensible 
event stream (XES) in 2009. XES standard became IEEE standard in 2016. 

Aalst et al. [6] suggest a different approach besides using interview techniques to obtain insights from 
business processes, which is analyse the event logs that produced by BPMS. However, it is hard to apply 
classical data mining techniques or statistical analysis to these event logs due to their semi-structure properties 
of XML format [7]. Accordingly, different methods and algorithms have been developed to extract information 
from these XML documents such as frequent subtree mining (FSM) [8]-[10]. FSM mainly looking for the 
patterns in a tree-structured database by using the support value determine by user. In other words, FSM is an 
association rule mining in tree structure database. 

Nevertheless, there are a few drawbacks using FSM to get insights from business process log data. 
Firstly, FSM methods usually ignore or do not account the node positional information [11]. This will result in 
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some information loss during extracting information from business process log data. As a matter of fact, this 
positional information may be substantial in some application scenarios. Secondly, support value is the only 
measurement in FSM. It is difficult to look for novel or interesting patterns when the support value is set very 
low during huge number of rules generated [12]. Furthermore, most of the researchers are focusing on 
improving the performance of FSM including finding ways to reduce average run time and memory usage [7]. 
Thus, the results acquired from FSM whether meaningful or interesting is unknown. Statistical analysis is one 
of many ways to overcome the limitations of FSM. Statistical analysis such as chi square test and regressions 
can help to filter irrelevant variables to reduce the rules that are meaningless and uninteresting [13]. 

In this study, a model is proposed to offer statistical approach able to be applied in business process 
log data specifically in XML format. The performance of business process may different due to some factor 
such as customer types, product types, geographically and time although the processes are identical. Usually, 
process variant analysis [14] is conducted to find out the difference between the business processes. However, 
statistical analysis such as t-test can be used as alternative to find that whether there is a difference in production 
from two same business process or two same machines in a production line during performing analysis on 
business process log data. Therefore, t-test is proposed in this study to determine whether there is a difference 
between the same process and the factor that influence the outcome of a business process. Two set of business 
process log data including simulated and real-life data are used in this study. 


2. RELATED WORKS 

In the era of big data, analytics are widely performed to obtain insights and knowledge to improve 
business process and allow better decision to be made. Therefore, analyzing event logs produce by BPMS is 
one of the important topics in nowadays. Due to some reasons, observing or analyzing whole population is 
difficult during practical. The same difficulties sometimes happen on event logs also especially when the 
population is in large scale. LogRank [15] is invented to sample large scale business process log data become 
smaller scale or size so that business process discovery can be performed easier. Gaaloul et al. [16] proposed 
applying statistical analysis on workflow log. The model statistical dependency tables (SDT) proposed by them 
can determine event dependencies by analyzing business process log data statistically. LogLens presented by 
Debnath et al. [17] can detect anomaly from event logs in real-time. On the other hand, log delta analysis 
proposed by Beest et al. [18] can identify behavioral difference between two business process log data. 

Usually, business process log data are stored in XML format [3]. One of the reasons these business 
process log data store in this semi-structured format is the capability of XML format to represent the contextual 
information among different attribute or metadata in a domain unambiguous method. Nevertheless, it is quite 
challenging to perform statistical analysis and data mining technique to XML data because of the complex data 
structure and dimensions (structure dimension and content dimension) [19]. Due to the similar characteristic of 
XML document and trees structured data, many researchers modelled XML document as an ordered, labelled and 
rooted trees. Frequent subtree mining is the most generally used for analyzing XML format document. Different 
algorithms of FSM have been developed by different researchers such as [8]-[10], from the past to improve the 
performance of FSM and reduce the memory usage and total running time. However, the information extract from 
XML documents by using FSM is limited. Due to the minimum support set by user is the only measurement used 
in FSM, the result of FSM only limit to most common tree can be found in the dataset. Thus, common structure 
of XML can be determined. Besides FSM, there are some other mining algorithm to look for most frequent rule 
in XML format document such as pre-order linked WAP-tree mining (PLWAP) [20], combination based 
behavioral pattern mining (COBPAM) [21] and frequent pattern mining [22]. 

To overcome the downside of FSM [23], proposed database structure model (DSM) to flatten XML 
format data so that more data mining techniques can be perform on XML documents. Numerous researchers 
such as [3] and [24] perform clustering or classification on XML format document including business process 
log data. Their research prove that applying DSM on XML format data have better result compare to FSM. 
Shaharanee ef al. [13] and Shaharanee and Jamil [25] suggest that performing correlation analysis can filter or 
reduce the unrelated variables in XML document to improve the interestingness of result after performing 
analytics or data mining. However, the weakness of DSM is assuming all tree structure in XML document are 
the same. Moreover, attributes are ignored when using DSM although the structural information are preserved. 


3. PROPOSED FRAMEWORK AND MODEL 
3.1. Mining business process log framework 

To mine business process log data [26], framework is used in this study. Figure | illustrates the 
framework to mine business process log data. Firstly, raw business process log data in XML format is pre- 
processed. Data that is not related to the transactions and corrupted are filtered. Then, data is extracted and 
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converted into structured data using flatten sequential structure model (FSSM). Finally, statistical analysis such 
as t-test is applied to determine the difference between two groups of data in this study. 


Phase 1 Phase 2 
Raw Data PRE-PROCESSING OF XML DATA Tree-Structured FSSM EXTRACTION Semi-Structured Flat 
Extract, Transform, Load (ETL); Data Convert Tree-Structured Data into Data 
Filtering Flat Data 
Phase 5 
INTERPRETATION Phase 4 Phase 3 


KNOWLEDGE DISCOVERY 
Data Mining Methods and 
Statistical Analysis 


FSSM CONVERSION 
onvert from Flat Data to Structured! 
Data 


Outlier Detection; Structured Data 
Prediction Performance; 


Bottleneck Detection 


Knowledge Model 


Figure 1. Mining business process log framework 


3.2. Flatten sequential structure model 

FSSM is proposed in this study to convert XML format data into structured data so that more data 
mining algorithm and statistical approach can be conducted to obtain insights from these semi-structured 
documents. FSSM is divide into two phases, extraction phase and conversion phase. The data structure and 
algorithm in each phase of FSSM are explained in detailly in this section. A synthetic tree database T. is shown 
in Figure 2 for explanatory purpose. Two different structures of rooted ordered labelled trees labelled as to and 
t, are illustrated in Figure 2 to show how tree structured format data is converted into structured format. 


J 
< 


Figure 2. Tree-structured database T. 


3.3. Data structure in FSSM 

The first phase of FSSM is to record the structural properties of every instance in a tree database. Put 
differently, the structural information is preserved in FSSM extraction phase. Table | shows that the tree- 
structured database T- is flatten and preserving the structural information at the same time. The sequence of the 
tree structured is viewed from top to bottom, then left to right. Node ‘a’ is the root of the tree t; and tz. Before 
proceeding to the next sibling, a backtrack to it’s parent of the node is required. Therefore, ‘-1’ in Table 1 
means backtrack. Next, FSSM conversion phase is converting the flatten data from semi-structure format to 
structured format. Table 2 illustrates the flatten data in FSSM extraction phase is converted into structured 
format during FSSM conversion phase. 


Table 1. FSSM extraction phase flatten data Table 2. FSSM conversion phase structured data 
Te Xo X1 X2 X3_ X4_ X5_—XG ab ec d 
th a b =] ¢ -l d -1 t) tia tib tic tid 
to a b C -l -1l d 0 to ta tb tee ted 


Enabling efficient business process mining using flatten sequential structure model ... (Ang Jin Sheng) 


534 im) ISSN: 2502-4752 


3.4. FSSM algorithms 

Algorithm 1 shows the pseudocode for finding maximum number of variables among transactions or 
subtrees. Firstly, the longest chain or the longest tree must be determined before FSSM starts. Thus, the table 
for first phase can be drafted using the maximum length and variables of the longest tree. 


Algorithm 1. Finding maximum number of variables 
Input: XML format dataset 
Output: Longest Node in dataset 
1: Define variable: 
Let Maximum = First Node 
Let Maximum Count = 0 
Let Maximum Node = First Transaction 
Let Maximum Variable = empty list 
2: for (each transaction in a tree) 
3 let node level = 1 and variable = empty 
4: while (elements not equal 0) 
Ss if (node have child) 
6: elements -- 
7 node level ++ 
8 variable = variable = element 
3 if (node do not have child) 


Maximum Variable = variable 
Maximum Node = transaction 


Os node level -- 

Is if (node level equal 0) 

Ze exit loop 

38 else 

4: variable = variable + backtrack 
ae count ++ 

6: if (count > Maximum Count) 

7: Maximum Count = count 

8: 

9 


Algorithm 2 illustrates FSSM phase 1, FSSM extraction phase. XML data is extracted and flattened 
through this phase. The data structure or result of FSSM extraction phase is shown in Table 1. Algorithm 3 shows 
the pseudocode for FSSM phase 2, FSSM conversion. The flatten data is converted to structured format using 
FSSM conversion phase. Lastly, the example of outcome of FSSM extraction phase is illustrated in Table 2. 


Algorithm 2. FSSM extraction (Phase 1) 
Input: XML format dataset 
Output: Flatten data 
1: Define variable: 
Let total column of FSSM table is the length of Maximum Node from algorithm 1 
Let FSSM table = empty 
Let structure table variable = empty 
for (each transaction in a tree) 
let node level = 1, column number = 1, FSSM row = empty 
while (elements not equal 0) 
if (node have child) 
elements -- 
node level ++ 
column number of FSSM row ++ 
if (node contains attribute) 
add node name and attribute into FSSM 
add attribute name into structure table variable 


womAAtnanuUFWNDN 


ODAINMDAAOBWNHEREOKRWNHE Cee oo oe e 


else 
add node name 
if (node do not have child) 
node level -- 
if (node level equal 0) 
exit loop 
else 
column number of the FSSM row ++ 
add ‘-1’ or ‘b’ into FSSM 
while (column number of current transaction < length of total column in FSSM 


a 
Rae omens 
Be 
(0) 


column number of FSSM row ++ 
add 0 into the list of FSSM row 
add FSSM row into FSSM table 
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Algorithm 3. FSSM conversion (Phase 2) 

Input: Flatten data from FSSM extraction 

Output: Structured data 

1: Filter only unique attribute name from structure table variable list from algorithm 2. 
2 Put the unique attribute name into columns of FSSM structured table. 

3: for (each transaction in FSSM flatten table) 

4 let node level = 0 and structure table = empty 

5 for (each variable in FSSM flatten table) 

6: if (FSSM table variable value == ‘b’ or FSSM table variable value == ‘1’ 
Ts node level -- 

a else 

8 if (FSSM table variable value not equal ‘0’ 

os node level ++ 

Ox if (node level > 1) 


‘less if (FSSM table variable contain attributes) 

2s put the attribute value according to the attribute 
/ column name in the structure table variable 

3 if (structure table contains any value) 

4: if (node level equal 1) 


put 0 into all attributes / columns which does not 
contain any value 


6: add transaction number into structure table 
eH add structure table into the complete table 
8: empty the structure table 


4. PROCEDURE 
4.1. Procedure employed 
Brief description procedure in this study to perform t-test on XML format business process event logs 

are given: 

— Firstly, raw data in XML format is cleaned and filtered. Only data with transaction information are 
remained and used in this study. 

— Then, tree structured data is flattened through FSSM extraction phase. 

—  FSSM conversion phase converts the flatten data into structured data. 

— t-test is conducted on the converted structured data. Null hypothesis will be rejected if the p-value is less 
than 0.05. Else, the null hypothesis failed to be rejected. 


4.2. Hypothesis testing 
In this study, two independent sample t-test is used for hypothesis testing. In the BPIC 2017 dataset, 
the null hypothesis of t-test is set as there is no differences of the amount requested between the applications 
is accepted and rejected. On the other hand, null hypothesis of simulated data is set as there is no differences 
between activity C and activity D. 
Steps of using t-test is shown as: 
a) Null hypothesis is set as: 
Ho: Wi = 2. 
Ai: wi F# pe 
b) Use the (1) to calculate the t-statistic where x,, x, are the sample 1’s mean and sample 2’s mean of 
respectively; n; and nz are the sample size of group | and group 2 respectively; s; and sz are the standard 
deviation of group | and group 2 respectively. 
c) Compute the p-value by comparing the t-statistic with t-distribution. 
d) ‘If the p-value is less than 0.05, reject the null hypothesis. 


t= ae (1) 


4.3. Summary of the data 
There are two sets of data used in this study. Real life and simulated business process log data. These 
datasets are described in detail as: 


4.3.1. Simulated data 

The simulated dataset is simulated using processes and logs generator 2 (PLG2) developed by [27]. 
PLG2 is an application to generate random business processes and its event logs. 200 transactions of data are 
generated randomly based on the business process illustrated in Figure 3. Activity C generates a value between 
500-1,200 whereas activity D generates a value between 800-1,200 randomly. 
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Figure 3. Business process simulated in PLG2 


4.3.2. Real life data 

Real life event log data is provided by the business process intelligence challenge (BPIC) 2017. There 
are two event logs provided in this challenge including application log and offer log [28]. However, only 
application event log is used in this study. The event logs contains 26 types of events that can be divided into 
3 categories, which are application state changes, offer state changes and workflow events [29]. There are 
around 31508 of transactions in the document provided by BPIC 2017. However, the first 200 transactions are 
used in this study. The process flow for BPIC 2017 dataset [30] is shown in Figure 4. 
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Figure 4. Business process of BPIC 2017 dataset 
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5. RESULTS AND DISCUSSION 

Firstly, the dataset is converted to flat data and then structured data before t-test conducted. The dataset 
is cleaned and converted using R version 4.05. Then, the data is export to csv format and import into SPSS to 
conduct t-test. SPSS version 26 is used in this study. Figure 5 shows the screenshot of flatten data of simulation 


dataset during FSSM extraction. 


A B Se a) E | F | G | J 

1 {xo ha x2 x3 x4 x5 x6 x7 x8 x9 

2 |trace concept:name-case_67 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
3 trace concept:name-case_159 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
4 ‘trace concept:name-case_68 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
5 = concept:name-case_181 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
6 |trace concept:name-case_185 b event concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event ' 
7 |trace concept:name-case_168 b event —concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
8 |trace concept:name-case_1l b event concept:name-Activity A b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
9 |trace concept:name-case_74 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
10 [trace concept:name-case_60 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
11 |trace concept:name-case_147 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
12 |trace concept:name-case 1b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
13 |trace concept:name-case_165 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
14 |trace concept:name-case_71 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
15 [trace concept:name-case_13. b event —_concept:name-Activity A b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
16 |trace concept:name-case_176 b event —_concept:name-Activity A b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
17 |trace concept:name-case_78 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
18 |trace concept:name-case_17  b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event 
19 |trace concept:name-case_85 b event —_concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
20 |trace concept:name-case_103 b event —concept:name-ActivityA b time:timestamp-1970-01-01T07:30:00+07:30 b b event | 
__ Bea ceeieaena cass me NE Seed,  Insemeaieenasa wamenaian Ne Aiee edie ede 409A Ae AeTATAnAnAtaAn ok ny Pare n 


Figure 5. Screenshot of flatten data for simulation dataset 


Figure 6 shows the screenshot of structured data for simulation dataset during FSSM conversion 
phase. Figure 7 shows the screenshot of simulation dataset after filtering activity C and activity D. The name 
of activity C and activity D change to | and 2 respectively. 


A B Cc D A B c D 
ie lransaction \condeptaname: |time:timestamp number_of_production 1 |Transactions concept:name time:timestamp number_of_production 
a7 poet a o il 1 1 1970-01-01T08:30:00+07:30 853 
3 |T7 Activity A 1970-01-01T07:30:00+07:30 0 3 2 1 1970-01-01T08:30:00+07:30 601 
4 17 Activity C 1970-01-01T08:30:00+07:30 853 4 3 2 1970-01-01T08:30:00+07:30 998 
5 |T7 Activity B 1970-01-01T09:30:00+07:30 0 5 4 2 1970-01-01T08:30:00+07:30 1134 
6 | case_159 0 o 6| 5 2. 1970-01-01T08:30:00+07:30 872 
7 \T8 Activity A 1970-01-01T07:30:00+07:30 0 7 6 1 1970-01-01T08:30:00+07:30 627 
8 138 Activity C 1970-01-01T08:30:00+07:30 601 8 7 1 1970-01-01T08:30:00+07:30 527 
9 it8 Activity B 1970-01-01T09:30:00+07:30 0 9g 8 1 1970-01-01T08:30:00+07:30 784 
10 |T9 case_68 0 0 10 9 2 1970-01-01T08:30:00+07:30 1179 
11 |T9 Activity A 1970-01-01T07:30:00+07:30 0 on 10 1 1970-01-01T08:30:00+07:30 736 
12 |T9 Activity D 1970-01-01T08:30:00+07:30 998 12 1 2 1970-01-01T08:30:00+07:30 1006 
13 |T9 Activity B 1970-01-01T09:30:00+07:30 0 13. 12 2. 1970-01-01T08:30:00+07:30 939 
14 |T10 case_181 0 0 14 13 1 1970-01-01T08:30:00+07:30 554 
15 |T10 Activity A 1970-01-01T07:30:00+07:30 0 15 14 2 1970-01-01T08:30:00+07:30 977 
16 |T10 Activity D 1970-01-01T08:30:00+07:30 1134 16 15 1 1970-01-01T08:30:00+07:30 533 
17 |T10 Activity B 1970-01-01T09:30:00+07:30 Oo 17] 16 1 1970-01-01T08:30:00+07:30 730 
18 |T11 case_185 0 oOo 18 17 1 1970-01-01T08:30:00+07:30 802 
19 |T11 Activity A 1970-01-01T07:30:00+07:30 0 19 18 1 1970-01-01T08:30:00+07:30 916 
20 eel Activity D 1970-01-01T08:30:00+07:30 872 20 19 2 1970-01-01T08:30:00+07:30 976 


Figure 6. Structured data for simulation dataset Figure 7. Filtered data for simulation dataset 


Figure 8 illustrates the screenshot of flatten data for BPIC 2017 dataset during FSSM extraction phase. 
Figure 9 illustrates the screenshot of structured data for BPIC 2017 dataset during FSSM conversion phase. 
Figure 10 illustrates the screenshot of BPIC 2017 dataset after filtering transaction, loan goal, application type, 
request amount of the loan and accepted or not for each transaction. The accepted result are true or false is 
converted to | or 0 respectively. 
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Figure 9. Structured data for BPIC 2017 dataset 


B | Cc Dae |e 
1 application_type request_aaccepted 
25 2 Existing loantakeover New credit 20000 
3 | 3 Homeimprovement —_ New credit 10000 
4 4 Home improvement New credit 15000 
5 | 5 Car New credit 5000 
6 | 6 Homeimprovement New credit 35000 
i 7 Existing loantakeover New credit 13000 
8 | 8 Existing loan takeover New credit 7000 
a 9 Homeimprovement New credit 15000 
10 | 10 Car New credit 15000 
11} 11 Car New credit 11000 
12 | 12 Other, see explanation New credit 5000 
13 | 13 Other, see explanation New credit 5000 
14, 14 Car New credit 6850 
15 | 15 Existingloantakeover New credit 29500 
16) 16 Home improvement New credit 6000 
17 | 17 Remaining debthome New credit 40000 
18 | 18 Car New credit 5000 
19 | 19 Not speficied New credit 5000 
20 20 Car New credit 15000 


Figure 10. Filtered data for BPIC 2017 dataset 
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x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 
LoanGoal-Existingloan takeover b ApplicationType-New credit b concept:name-Application_652823628 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Home improvement  b ApplicationType-New credit b concept:name-Application_1691306052 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Home improvement  b ApplicationType-New credit b concept:name-Application_428409768 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_1746793196 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Home improvement  b ApplicationType-New credit b concept:name-Application_828200680 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Existingloantakeover b ApplicationType-New credit b concept:name-Application_1085880569 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Existingloantakeover b ApplicationType-New credit b concept:name-Application_1266995739 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Home improvement b ApplicationType-New credit b concept:name-Application_1878239836 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_619403287  b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_1710223761 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Other, see explanation b ApplicationType-New credit b concept:name-Application_1529124572 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Other, see explanation b ApplicationType-New credit b concept:name-Application_387012864 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_1120819670 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Existing loan takeover b ApplicationType-New credit b concept:name-Application_42838382 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Home improvement  b ApplicationType-New credit b concept:name-Application_180547487 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Remaining debthome b ApplicationType-New credit b concept:name-Application_1966208034 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_1806387393 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Not speficied b ApplicationType-New credit b concept:name-Application_1111870538 b Requesteib event Action-Created b org:resource-User_1 
LoanGoal-Car b ApplicationType-New credit b concept:name-Application_1017492916 b Requesteib event Action-Created b org:resource-User_1 
Figure 8. Flatten data for BPIC 2017 dataset 
A B | c Dye Fae eG FA eK ee Se M | oN ORs iae QS LeRe sl 
I anGoal Applicaticconcept:n RequesteiAction —_org:resoulEventOrig EventID _lifecycle:t time:time FirstWithc NumberO Accepted MonthlyC Selected CreditScoi OfferedAr¢ 
Existing Ic 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 New credj 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 Applicatic 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 20000 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0A Create. 0 Created User_1 ApplicaticApplicatic complete 2016-01-0: 0 0 0 0 0 0 0 
0 0 A Submit 0 statechan User_1 Applicatic AppIState complete 2016-01-0: 0 0 0 0 0 0 0 
0 0 W_Handle 0 Created User_1 Workflow Workitemschedule 2016-01-0: 0 0 0 0 0 0 0 
0 0 W_Handle 0 Deleted User_1 Workflow Workitem withdraw 2016-01-0: 0 0 0 0 0 0 0 
0 0 W_Compl O Created User_1 Workflow Workitemschedule 2016-01-0: 0 0 0 0 0 0 0 
0 0 A_Concep 0 statechan,User_1 Applicatic ApplState complete 2016-01-0: 0 0 0 0 0 0 0 
0 0 W_Compl 0 Obtained User_17 Workflow Workitem start 2016-01-0: 0 0 0 0 0 0 0 
0 0 W_Compl 0 Released User_17 Workflow Workitemsuspend 2016-01-0: 0 0 0 0 0 0 0 
0 0 A _Accepte 0 statechan User_52 Applicatic ApplState complete 2016-01-0: 0 0 0 0 0 0 0 
0 0 O Create OQ Created User_52 Offer Offer_148 complete 2016-01-0: 20000 44 TRUE 498.29 TRUE 979 =. 20000 
0 0 O _Createc 0 statechan,User_52 Offer  OfferStatecomplete 2016-01-0: 0 0 0 0 0 0 oC 
0 0 O Sent (rm 0 statechan User_52 Offer OfferState complete 2016-01-0: 0 0 0 0 0 0 0¢ 
0 0 W_Compl 0 Deleted User_52 Workflow Workitem ate_abort 2016-01-0: 0 0 0 0 0 0 0 
0 0 W Call afi 0 Created User_52 Workflow Workitemschedule 2016-01-0: 0 0 0 0 0 0 0 
0 0 W Call aft 0 Obtained User_52 Workflow Workitem start 2016-01-0; 0 0 0 0 0 0 0 
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Tables 3 and 4 summarizes the summary of the simulation and real-life dataset respectively after 
filtering the data required for t-test. The minimum value for production in simulation data is 501 and maximum 
value is 1196. On the other hand, the value for request amount for BPIC 2017 dataset is between 5000 and 
50000. The mean for simulation dataset and BPIC 2017 dataset are 856.32 and 16359 respectively. 

The result of t-test is summarized in Table 5. There is a difference between two production values in 
simulation data as the p-value is less than 0.05. However, the p-value for BPIC 2017 dataset is higher than 
0.05. Thus, there is no difference between whether the application is accepted or rejected for the requested 
amount in BPIC dataset. By conducting t-test on business process log, difference between two process or 
outcome can be determined. Therefore, performing statistical test on business process log can extract more 
information such as finding out the difference between processes, relationships between variables compare to 
FSM. The limitation of FSM finding frequent pattern of subtree only based on support set by user can be 
overcome by statistical test to get more knowledge from business process log data. 


Table 3. Descriptive statistics for simulation data 
Parameter N Minimum Maximum Mean _ Std. Deviation 
number_of_production 200 501 1196 856.32 197.708 


Table 4. Descriptive statistics for BPIC 2017 data 
Parameter N Minimum Maximum Mean _ Std. Deviation 
request_amount 200 5000 50000 16359.00 10827.713 


Table 5. Results of t-test after using FSSM 


Dataset Null hypothesis p-value Decision 
Simulation data Ho: There is no difference of production value between activity C and activity D. 0.00 Reject Null 
H;: There is a difference of production value between activity C and activity D. hypothesis 
BPIC 2017 Ho: There is no difference of the amount requested between the applications is 0.057 Failed to reject 
accepted and rejected. null hypothesis 


H;: There is a difference of the amount requested between the applications is 
accepted and rejected. 


6. CONCLUSIONS AND FUTURE WORKS 

Extracting information from business process log especially in XML format usually done using 
traditional process mining or frequent structure mining. FSM usually can mine information such as frequent 
patterns or subtrees in business process log data. However, the interestingness of result and more information 
such as difference between processes or relationship between variables cannot be determined using FSM. As 
the business process getting more complex and increasing in numbers, this paper introduces a model that 
enables wider range of application in data mining or statistical analysis conducted in tree structured business 
process logs. Two experiments including simulation data and real-life data are done to show the promising 
capabilities of proposed method. Data mining techniques such as classifications algorithm or more statistical 
test can be explored using the proposed framework and model in the future research. For example, relationship 
test such as Pearson correlation test can be done to reduce the variables before doing classification in business 
process log data. 
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